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ABSTRACT Like other species of Drosophila, Drosophila pseudoobscura has a distinct bias toward the 
usage of C- and G-ending codons. Previous studies have indicated that this bias is due, at least in part, to 
natural selection. Codon bias clearly differs among amino acids (and other codon classes) in Drosophila, 
which may reflect differences in the intensity of selection on codon usage. Ongoing natural selection on 
synonymous codon usage should be reflected in the shapes of the site frequency spectra of derived states 
at polymorphic positions. Specifically, regardless of other demographic effects on the spectrum, it should 
be shifted toward higher values for changes from less-preferred to more-preferred codons, and toward 
lower values for the converse. If the intensity of natural selection is increased, shifts in the site frequency 
spectra should be more pronounced. A total of 33,729 synonymous polymorphic sites on Chromosome 2 in 
D. pseudoobscura were analyzed. Shifts in the site frequency spectra are consistent with differential in- 
tensity of natural selection on codon usage, with stronger shifts associated with higher codon bias. The 
shifts, in general, are greater for polymorphic synonymous sites than for polymorphic intron sites, also 
consistent with natural selection. However, unlike observations in D. melanogaster, codon bias is not re- 
duced in areas of low recombination in D. pseudoobscura] the site frequency spectrum signal for selection 
on codon usage remains strong in these regions. However, diversity is reduced, as expected. It is possible 
that estimates of low recombination reflect a recent change in recombination rate. 
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The relative usage of synonymous codons varies among genes within 
an organism. In some organisms (e.g., humans), this variation may 
largely reflect base composition variation across the genome (Bernardi 
et al. 1985; Kliman and Bernal 2005). In many organisms, however, 
natural selection appears to directly influence codon usage, with pos- 
itive correlations between the levels of codon bias and gene expression 
that are consistent with selection on transcriptional efficiency and/or 
fidelity (Chavancy et al 1979; Gouy and Gautier 1982; Ikemura 1985; 
Akashi 1994; Moriyama and Powell 1997; Kliman and Henry 2005; 
Plotkin and Kudla 2011). This relationship was first reported for 
Drosophila by Shields et al (1988), and numerous studies using di- 
verse approaches have supported the hypothesis that natural selection 
influences codon usage in several Drosophila species (Kliman and Hey 
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1993; Akashi and Schaeffer 1997; Carlini and Stephan 2003; Haddrill 
et al 2011; De Proce et al. 2012). Effective weak selection on codon 
usage requires a sufficient effective population size to overcome the 
effects of genetic drift [although see Hershberg and Petrov (2008), 
who point out that selection on codon usage may not be always be 
weak]. However, evidence has been emerging that natural selection 
may even be influencing codon usage in humans (Lavner and Kotlar 
2005), mammals more generally (Yang and Nielsen 2008), and other 
vertebrates (Doherty and Mclnerney 2013). 

Among the studies supporting the hypothesis that selection 
influences codon usage are (1) those that show shifts in the site 
frequency spectra (SFS) of derived states at synonymous polymorphic 
sites, such that the SFS is shifted toward higher values for changes to 
more preferred codons (Akashi and Schaeffer 1997; Kliman 1999; 
Llopart et al 2008); and (2) those that show significantly reduced 
codon bias in areas of the genome with very low recombination rates 
(Kliman and Hey 1993; Hey and Kliman 2002). The latter is consistent 
with the expectation that natural selection will be less effective in the 
absence of recombination due to linkage disequilibrium among targets 
of selection; these are often in repulsion due to independent emer- 
gence of selectively favored or disfavored mutations on different cop- 
ies of a chromosome in a population (Hill and Robertson 1966; 
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Felsenstein 1974; McVean and Charlesworth 2000; Comeron et al. 
2008). Limited recombination among targets of selection is also 
predicted to lead to reduced diversity, whether by selective sweeps 
that wipe out standing variation (Maynard Smith and Haigh 1974; 
Gillespie 2000) or by background selection against continually 
arising deleterious mutations that prevent diversity from accumu- 
lating in the first place (Charlesworth et al. 1993; McVean and 
Charlesworth 2000). This prediction was most notably confirmed 
by Begun and Aquadro (1992) in D. melanogaster and has been 
corroborated by subsequent studies in Drosophila (Comeron et al. 
2012) and other organisms [reviewed by Nachman (2002) and 
Stephan (2010)]. 

These earlier studies often relied on estimates of recombination 
rate derived by fitting recombination maps to physical maps, using 
a variety of line- or curve-fitting approaches. The advent of "next- 
generation" DNA-sequencing methods has allowed investigators to 
identify numerous single-nucleotide polymorphisms that can be used 
in testcrosses to directly estimate recombination rate at a fine scale. 
Cirulli et al (2007) directly estimated recombination rates across 
a section of the D. pseudoobscura X chromosome and found consider- 
able heterogeneity in recombination rate. Kulathinal et al (2008) 
showed that estimates of recombination rate at finer scales (i.e., with 
more densely spaced markers) correlated better with diversity, a find- 
ing that suggests that fine- scale recombination rates are more reliable 
when they can be obtained. McGaugh et al (2012) extended this work 
to three complete chromosomes, not only confirming heterogeneity, 
but showing that estimates could be replicated using crosses of differ- 
ent strains. Importantly, McGaugh et al (2012) also sequenced 10 
additional D. pseudoobscura genomes (along with those of other close 
relatives) and observed the predicted correlation between recombina- 
tion rate and diversity. Stevison and Noor (2010) observed a similar 
relationship between fine-scale recombination rate and diversity in 
closely related D. persimilis. 

Although previous polymorphism-based studies on natural selec- 
tion codon usage in D. pseudoobscura have relied on hundreds of 
variable sites, the Chromosome 2 data from McGaugh et al (2012) 
provide tens of thousands of variable sites. These data, therefore, allow 
us to much more thoroughly investigate the effects of natural selection 
on codon usage in this species. In addition to providing increased 
statistical power to detect subtle effects, it becomes possible to sub- 
divide data and retain statistical power to test for differential effects. In 
particular, we investigate whether variation among amino acids in the 
degree of codon bias can be explained by variation in selection in- 
tensity. Furthermore, because of its generally higher recombination 



rate, analysis with D. pseudoobscura provides a valuable contrast to D. 
melanogaster. 

As expected, we observe a fairly strong correlation between 
recombination rate and diversity. Although other composition-biasing 
influences may be influencing patterns of diversity, most notably G/C- 
biased gene conversion (Marais et al. 2001, 2003; Singh et al. 2005), 
a comparison of SFSs of synonymous and intron sites shows that the 
SFS shifts are significantly stronger at synonymous sites. Therefore, 
although selectively neutral influences may be partially responsible for 
the observed SFS shifts, the data support an influence of natural 
selection on codon usage. Furthermore, differences among subsets 
of codons in the SFS shifts are consistent with the differential influ- 
ence of natural selection. We do not, however, find that selection on 
codon usage is consistently weaker in areas with lower estimates of 
recombination rate in D. pseudoobscura. 

MATERIALS AND METHODS 
Data set 

Chromosome 2 was recently sequenced in 10 strains of D. pseudoobs- 
cura, along with one strain of the outgroup D. lowei y using Illumina 
platforms (McGaugh et al. 2012) (National Center for Biotechnology 
Information sequence read archive accession numbers SRA044960.1, 
SRA044955.2, and SRA044956.1). The reference strain of D. pseu- 
doobscura (Richards et al. 2005) v2.9 was also included in analyses. 
The D. pseudoobscura strains represent nearly isogenic lines generated 
by full-sibling matings over several generations (Machado et al. 2002; 
McGaugh et al. 2012). Population structure is very limited in D. 
pseudoobscura (Schaeffer and Miller 1992; Noor et al. 2000), making 
it unlikely that the choice of strains, including the reference strain, 
would influence patterns of diversity. Of the 55 possible pairwise 
contrasts, 10 of the 11 that showed the lowest pairwise difference at 
synonymous sites included the reference strain. The reference strain 
also contributes the smallest number of derived singletons to the 
polymorphism data set, suggesting that, if anything, there are fewer 
sequencing errors in the reference strain than in the others. 

Genes 

Genes were excluded from analysis if any of the following criteria were 
met in the reference sequence: the annotated start codon was not 
AUG; the annotated stop codon was not UGA, UAG, or UAA; any 
amino acid codon was incompletely resolved; there was a premature 
stop codon; or the intron/exon boundaries were noncanonical. A total 
of 3548 genes met all inclusion criteria. 
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Recombination rates 

Estimates were obtained from a pair of testcrosses (Flagstaff and Pikes 
Peak) described in McGaugh et al. (2012). The Flagstaff testcross 
involved two strains from Arizona bearing the "Arrowhead" arrange- 
ment on chromosome 3; the Pikes Peak testcross involved two lines 
from New Mexico bearing the "Pikes Peak" arrangement on chromo- 
some 3. As these authors noted, recombination rates estimated from 
the independent testcrosses were very similar along Chromosome 2 
(Figure 1). Estimates were obtained for 140 segments in the Flagstaff 
cross and for 158 segments in the Pikes Peak cross. Except near the 
ends of the chromosome, most positions are represented in both re- 
combination maps. Unless otherwise stated, for all analyses, the aver- 
age of the two recombination rates was used for these positions. 

Site inclusion criteria 

For each sequence, excluding the reference strain, a base was 
considered unresolved ('N') if either the phred consensus quality score 
was below 30 or the depth of coverage in the alignment was below 
15 x . To be included in the analysis, an intron site or complete codon 
(i.e., all three sites) had to be completely resolved in all 12 strains. A 
total of 35,376 codons meeting these criteria displayed synonymous 
polymorphism in D. pseudoobscura. Of these, 33,755 segregated two 
character states; 26 of these loci were in regions for which no recom- 
bination rate has been obtained (McGaugh et al. 2012) and were 
excluded from analyses involving recombination. A total of 68,332 
intron sites meeting the inclusion criteria were polymorphic in 
D. pseudoobscura; of these, 65,266 segregated two character states. 

Statistical analyses 

Statistical analyses were performed using R v3.0.1 for Mac OS X (the R 
Foundation for Statistical Computing), implemented in RStudio 
vO.97.551 (RStudio 2013). 

Inference of preferred codons 

Following Hey and Kliman (2002), preferred codons were inferred by 
factor analysis on the 59 codons that encode the 18 degenerate amino 
acids. Only genes that used all 18 amino acids were included in the 
factor analysis. The primary factor was polarized, such that values 
correlated positively with Chi/L (Shields et al 1988) and negatively 
with effective number of codons (Wright 1990); both are measures of 
uneven codon usage, with Chi/L increasing and ENC decreasing as 
codon usage becomes less even. Codons that loaded positively on the 
primary factor were considered "preferred." The degree of preference 
(or "preference score") of each codon was denned as its loading score 
(Llopart et al 2008). 

For each synonymous polymorphic site, Apref was defined as 
prefderived - pref ances trai> where pref derived and pref ancestra i are the co- 
don preference scores for the derived and ancestral states, respectively, 
as inferred by parsimony (Llopart et al 2008). For convenience, a site 
is defined as P— >U if Apref is negative, and as U— >P if Apref is 
positive. This designation is not clear-cut for amino acids with de- 
generacy above 2; for example, a mutation may substitute one pre- 
ferred codon with another preferred codon. However, for our 
analyses, the polarity of the fitness effect is more important than 
the assignment of "preferred" or "unpreferred." 

Analysis of diversity 

SFS of derived preferred vs. derived unpreferred codons (or similar 
contrasts) were compared using parametric and nonparametric tests 
(Akashi and Schaeffer 1997), as well as a permutation test (Llopart et al 
2008; described in context under Site frequency spectra relative to Apref). 



Table 1 Codon preference scores 

Codon Amino Acid Pref Score Codon Amino Acid Pref Score 



GCC 


3 la 


0.571 


CTG 


leu 


0.697 


GCG 


sis 


0.076 


CTC 


leu 


0.316 


GCT 


ala 


-0.401 


TTG 


leu 


-0.434 


GCA 


a la 


-0.491 


CTA 


leu 


-0.443 


CGC 




0.501 


CTT 


leu 


-0.458 


CGG 


ara 


0.163 


TTA 


leu 


-0.572 


CGT 


ara 


-0.090 


AAG 


Ivs 

iyo 


0.700 


AGG 




-0.188 


AAA 


iyb 


-0.700 


CGA 


am 
ai y 


-0.279 


TTC 


phe 


0.534 


AGA 


am 
a\ y 


-0.471 


TTT 


phe 


-0.534 


AAC 




0.461 


ccc 




0.425 


AAT 


asn 


-0.461 


CCG 


pro 


0.21 1 


GAC 


asp 


0.392 


CCT 


pro 


-0.404 


GAT 


asp 


-0.392 


CCA 


pro 


-0.442 


TGC 


t-yb 


0.304 


TCC 


ser 


0.287 


TGT 




-0.304 


AGC 


ser 


0.270 


CAG 


a I n 
y in 


0.656 


TCG 


ser 


0.228 


CAA 


n I n 
y in 


-0.656 


AGT 


ser 


-0.297 


GAG 


nil i 

y iu 


0.724 


TCT 


ser 


-0.351 


GAA 


nli i 
y iu 


-0.724 


TCA 


ser 


-0.456 


GGC 


9'y 


0.430 


ACC 


thr 


0.435 


GGG 


n l\/ 


-0.083 


ACG 


thr 


0.204 


vJVJ I 


9'y 


—0.222 


ACT 


thr 
LI 1 1 


U.JJ/ 


GGA 


g'y 


-0.291 


ACA 


thr 


-0.416 


CAC 


his 


0.331 


TAC 


tyr 


0.421 


CAT 


his 


-0.331 


TAT 


tyr 


-0.421 


ATC 


ile 


0.584 


GTG 


val 


0.453 


ATT 


ile 


-0.345 


GTC 


val 


0.245 


ATA 


ile 


-0.405 


GTA 


val 


-0.489 








GTT 


val 


-0.492 



Pref, preference. 



RESULTS 

Inference of preferred codons 

All C- ending codons, and all but three G- ending codons, are preferred in 
D. pseudoobscura. All A-and T-ending codons are unpreferred (Table 1). 

Estimates of diversity, divergence, and codon bias 

Synonymous sites were counted taking into account the degeneracy of 
codons. For example, a fourfold degenerate codon would provide one 
synonymous site, whereas a twofold degenerate codon would provide 
1/3 of a synonymous site. A total of 543,985 synonymous sites and 
2,746,629 intron sites were completely resolved in all 12 strains. The 
Watterson (1975) estimator of synonymous 6 in D. pseudoobscura was 
0.0222 across all sites; average synonymous divergence from D. lowei 
was 0.0760. Diversity and divergence varied among amino acids 
(Table 2), with twofold degenerate amino acids having higher values 
of both. Watterson's estimator of intron 6 in D. pseudoobscura was 
0.0085; divergence from D. lowei was 0.0297. The lower values for 
intron sites probably reflect a larger denominator, as each intron site 
was counted as one full site. 

Association of diversity and codon bias with 
recombination rate 

Using average recombination rate (Flagstaff and Pikes Peak), loci 
were placed into 25 recombination categories: 0.00—0.25 cM/Mb, 
>0.25-0.50 cm/Mb, >5.75-6.00 cM/Mb, and >6.00 cM/Mb. (No 
sites are in regions with > 0.25— 0.50 cM/Mb.) As shown in Figure 2 A, 
synonymous diversity increases with recombination rate (defined by the 
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Table 2 Estimates of synonymous divergence and diversity 


A 


Amino 






^syn 

(2) c 








'S 


AGO. 


N a 

1 v syn 


S b 

^syn 


D d 


e w /bp e 


D/bp f 


ta/b 


All 


543,985.0 35,376 


33,729 41,360 


0.022203 


0.076032 


-C 


ala 


50,380.0 


2661 


2513 


3126 


0.018033 


0.062048 




arg 


48,644.0 


2558 


2365 


3047 


0.017954 


0.062639 




asn 


13,894.3 


1282 


1281 


1621 


0.031502 


0.116667 




asp 


15,130.0 


1303 


1302 


1835 


0.029403 


0.121282 


3 
O 

E 


cys 


5,124.6 


500 


500 


576 


0.033311 


0.112398 


> 


glu 


18,283.6 


1733 


1731 


1936 


0.032361 


0.105887 


o 
c 


gin 


12,420.7 


1034 


1034 


1 151 


0.028422 


0.092668 


$ 


giy 


40,903.0 


2600 


2427 


2932 


0.021702 


0.071682 




his 


7051.3 


593 


593 


813 


0.028713 


0.115298 




ile 


31,135.4 


1833 


1782 


2028 


0.020100 


0.065135 




leu 


91,350.6 


5619 


5139 


6313 


0.021001 


0.069107 


B 


lys 


17,437.3 


1500 


1500 


1817 


0.029370 


0.104202 




phe 


11,413.0 


1358 


1356 


1634 


0.040624 


0.143170 




pro 


35,924.0 


2137 


1975 


2481 


0.020310 


0.069062 


-Q 

ra 


ser 


44,903.3 


2865 


2735 


3331 


0.021784 


0.074182 


QJ 
-C 


thr 


43,553.0 


2406 


2257 


2775 


0.018861 


0.063715 




tyr 


9123.6 


851 


849 


1062 


0.031845 


0.116401 




val 


47,313.0 


2543 


2390 


2882 


0.018351 


0.060913 


a) 
> 



0.035 



Number of synonymous sites in D. pseudoobscura. 

Number of synonymous polymorphic sites in D. pseudoobscura. 

Number of synonymous polymorphic sites segregating two codons in 
^ D. pseudoobscura, and for which a recombination rate estimate is available. 

Number of divergent synonymous sites between the reference strain of 

D. pseudoobscura v2.9 (Richards et a/. 2005) and D. lowei for codons fully 

resolved in all D. pseudoobscura strains and in D. lowei. 
j Watterson (1975) estimator of synonymous theta in D. pseudoobscura. 

Synonymous divergence per base pair. 



upper bound of the category) until about 2 cM/Mb, at which point 
diversity levels off [r = 0.800, 21 degrees of freedom (d.f.), 1 -tailed 
P < 10~ 5 ]. A very similar relationship is observed for intron diversity 
(r = 0.703, 21 d.f, 1-tailed P < 10" 4 ) (Figure 2B). However, there is no 
clear relationship between recombination rate and the frequency of 
optimal codons [F op , a measure of preferred codon usage (Sharp and 
Devine 1989)] (r = 0.209, 21 d.f., 1-tailed P = 0.158) (Figure 2C). 

Synonymous diversity varied spatially along Chromosome 2. Using 
the 142 segments denned by the Flagstaff recombination map (the 140 
regions with recombination rate estimates, along with the two external 
regions), there is obvious heterogeneity in levels of synonymous 
diversity (Figure 3A). However, there is also considerable variation in 
the number of sites analyzed, with as few as 14 sites to as many as 22,635. 
With a median of 2897 sites per segment, 18 of the 142 segments had 
fewer than 1000 sites, and 29 had more than 5000 sites. When only the 
latter are plotted to minimize sampling error (Figure 3B), diversity is 
clearly reduced at the two ends of the chromosome, and it is also 
somewhat suppressed in the center of the chromosome. As expected 
based on the analysis of recombination rate classes, there is a positive 
correlation between recombination rate in each segment and synony- 
mous diversity (r = 0.533, 2-tailed P = 0.0026 for the regions with at least 
5000 sites). The correlation, although slightly weaker (r = 0.350), 
remains highly significant (P < 10~ 4 ) when all regions are included. 
For intron sites, the correlation between diversity and recombination 
rate is somewhat weaker, but still significant, for regions with at least 
5,000 sites (r = 0.268, n = 120, P = 0.0029). However, when regions with 
fewer sites are included, the correlation is lost (r = —0.029). In contrast 
to diversity, variation along Chromosome 2 in F op appears to be negli- 
gible (Figure 3, C and D). Results were qualitatively similar, includ- 
ing the significant correlation between recombination rate and either 
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Figure 2 Diversity and codon bias relative to recombination rate. 
Points are plotted at the upper end of the recombination rate range 
(e.g., at 0.25 for 0-0.25 cM/Mb); the red point represents sites in 
regions with recombination rate above 6 cM/Mb. (A) Synonymous di- 
versity measure is the Watterson (1975) estimator of 0. (B) Intron di- 
versity. (C) Codon bias measure is F op (Sharp and Devine 1989). 

synonymous or intron diversity, using the Pikes Peak recombination 
map (data not shown). 

Site frequency spectra relative to Apref 

Under a constant-iV e Wright-Fisher neutral model, the relative fre- 
quency of sites with ^-derived states is lid, such that the expected 
mean d is (k — 1)1 a, where k is the sample size and a is the sum of 1, 
1/2, l/(k — 1) (Fu 1995). For k- 11 D. pseudoobscura sequences, 
we expect 3.414 derived states/site. Overall, our data for synonymous 
polymorphic sites indicate a shift of the SFS toward lower values, with 
a mean of 2.59 derived states/site. For intron sites, the mean was 2.279 
derived states/site. However, there are noticeably raised tails in the 
SFSs, likely due in part to ancestral state misassignment (ASM; dis- 
cussed in Impact of ASM). 
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Figure 3 Diversity and codon bias along Chromosome 2. (A) Diversity 
in all recombination map segments. Segments upstream (FLint_upout) 
and downstream (FLint_dnout) of the recombination map are also 
shown; for these, there is no corresponding recombination rate 
estimate. (B) Diversity in segments with at least 5000 synonymous 
sites. (C) Codon bias (F op ) in all segments. (D) Codon bias in segments 
with at least 5,000 synonymous sites. Recombination rates (cM/Mb 
from the Flagstaff testcross) are shown for reference. 
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Figure 4 Sites frequency spectrum for synonymous polymorphic sites. 
Shown are sites that segregate two codons and fall within a region for 
which recombination rate was estimated. "P to U," a change to a more 
unpreferred codon; "U to P," a change to a more preferred codon. 



If natural selection is acting on codon usage, then the SFSs for 
P — > U and U — > P changes should be shifted relative to each other; i.e., 
the SFS for U— >P changes should be shifted toward higher values 
(Akashi and Schaeffer 1997). Consistent with natural selection, the 
synonymous SFS was shifted toward higher values for the 9918 
U^P sites (mean = 3.395) than for the 23,811 P^U sites (mean = 
2.258) (Figure 4). However, the SFS for the U— >P sites is not right- 
shifted relative to the expectations of a constant-iV e Wright-Fisher 
neutral model; this may indicate a demographic effect on diversity, 
such as historically increasing N e (Tajima 1989a). The difference be- 
tween U— >P and P— >U sites is highly significant using either a 
1 -tailed Student's f-test (t.test in R, t = 33.715, P < 10" 15 ) or a 1 -tailed 
Mann- Whitney [/-test (following Akashi and Schaeffer (1997)) 
(wilcox.test in R, P < 10" 15 ). 

Given that singletons (i.e., sites with 1 or 10 individuals carrying 
the inferred derived state) are the most likely polymorphic sites to 
represent sequencing errors, the analyses were repeated excluding 
5045 U^P and 14,912 P^U singletons. The SFS shift remained 
highly significant (U— >P mean = 4.456; P— >U mean = 3.767; t = 
16.807, P < 10" 15 ; [/-test, P < 10" 15 ). Under a constant-iV e Wright- 
Fisher neutral model, the expected frequency of derived states per 
nonsingleton site would be 4.374. Thus, the SFS for U — > P sites was 
shifted slightly toward higher values, whereas that for P — > U sites was 
shifted toward lower values. 

Analyses were repeated for each amino acid. In every case, the SFS 
was shifted toward higher values for U — > P sites than for P — > U sites, 
although there was variation among amino acids in the extent of the 
shift (see Variation among amino acids in the SFS shift). The results 
were qualitatively unchanged when singletons were excluded (Table 3). 

Llopart et al. (2008) proposed an alternative test, based on the 
prediction that natural selection should lead to a positive relationship 
between d and Apref. Computationally, the sum of d x Apref can 
serve as a proxy for a correlation or regression coefficient; therefore, 
1 -tailed statistical significance can be estimated from the proportion of 
random permutations of d vs. Apref that lead to a higher sum of 
products. The test is significant for all amino acids (including or 
excluding singletons; see Table 4), as well as for all sites pooled. 

Llopart et al. (2008) also proposed a modification to the test to 
correct for ASM. Essentially, a simple Bayesian approach was sug- 
gested to calculate a posterior odds ratio of correct assignment by 
parsimony to ASM. The likelihood ratio was based on the neutral 
SFS in a constant-N population: likelihood ratio = (k — d)ld, where 
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Table 3 Shifts in site frequency spectra for each amino acid 
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^ N, number of polymorphic sites. 
Mean frequency of derived states/site. 
P-values are for 1 -tailed tests. 



d is the frequency of derived states assuming parsimony, k is the 
number of individuals in the sample (here, 11), and k — d is the 
number of derived states when parsimony is incorrect {i.e., when there 
is ASM). The prior odds ratio is the relative probability of no mutation 
on the branch connecting the outgroup to the base of the ingroup 
coalescent to the probability of a single mutation on that branch: 



Prior odds ratio = 




where D and 0 are estimates of divergence and diversity, respec- 
tively. Thus, following Llopart et al. (2008), the posterior odds ratio 
is calculated as: 

k-df k-l \~ l 
Posterior odds ratio = — - — I f)— — —q I . (2) 

The permutation test can then be performed after randomly 
assigning an ancestral state for each site using the posterior odds 
ratio. As shown in Table 4, this usually slightly more conservative 
test (using distinct estimates of 0 and D for each amino acid) 
remains significant for all analyses. 

Following up on the recombination analyses, we compared the 
slopes of the regression lines (d on Apref) for sites in four recombi- 
nation classes [following McGaugh et al. (2012)]: 0-0.5, >0.5-3.0, 
> 3.0— 6.0, and >6.0 cM/Mb. For the individual testcrosses (Flagstaff 
and Pikes Peak), we compared sites in 0 -recombination regions to 
sites found elsewhere. The slope was not lower for the 0—0.5 cM/ 
Mb class (b = 0.864, n = 1164) than for the three other classes 
(>0.5-3.0 cM/Mb: b = 0.648, n = 14,088; >3.0-6.0 cM/Mb, b = 
0.672, n = 13,817; >6.0 cM/Mb, b = 0.832, n = 4660). For the Flagstaff 
testcross, the slope was significantly lower for the 0-recombination 
regions (b = 0.164, n = 217) than for higher-recombination regions 
(b = 0.694, n = 33,226; P = 0.0002, 1 -tailed Tukey-Kramer test). 
This result was not mirrored, however, for the Pikes Peak testcross 



(0-recombination regions: b = 0.903, n = 1467; higher recombination 
regions: b = 0.679, n = 32,262). In fact, mean d for U— >P changes 
exceeded mean d for P — > U changes in all five regions with recombi- 
nation estimates of 0, the difference ranging from 0.804 (interval 152, 
n = 258) to 2.011 (interval 151, n = 133). One-tailed Mann- Whitney 
[/-tests were significant after sequential Bonferroni correction (Rice 
1989) for four of the five contrasts. The permutation test on each of 
the five regions produced similar results; all five tests were significant 
after Bonferroni correction assuming parsimony, and three were sig- 
nificant after Bonferroni correction when allowing for ASM. There- 
fore, if the slope of the regression line corresponds to effectiveness of 
selection on codon usage, there is only equivocal evidence for an effect 
of low recombination in D. pseudoobscura. 

Variation among amino acids in the SFS shift 

As noted previously, for all amino acids, the average frequency of 
derived states at synonymous polymorphic sites was greater for P — > U 
changes than for U — > P changes. This result is consistent with natural 
selection on synonymous codon usage. However, it is also consistent 
with the G/C-biased gene conversion (although recent work by 
Comeron et al. (2012) suggests that that this does not occur in D. 
melanogaster). In the latter, individuals heterozygous for a preferred 
and an unpreferred codon will usually be segregating a pair of purines 
or a pair of pyrimidines at the synonymous site (usually the third 
position of a codon). If heteroduplex intermediates generated during 
crossing-over tend to resolve toward G or C, then this process could 
lead to shifts in the SFS even if crossing- over is not, itself, mutagenic. 

In the standard genetic code, there are 16 T/C-ending synonymous 
codon pairs and 13 A/G-ending synonymous codon pairs. Although 
the degree of bias varies, C- or G- ending codons are usually used 
disproportionately (Table 5). The C-ending codon is always used 
disproportionately, although barely so for Asp (50.5% C-ending). 
For the A/G pairs, the G-ending codon is used disproportionately 
in all cases except for Gly, where unpreferred GGG is used less often 
than unpreferred GGA. 
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Table 4 Shifts in site frequency spectra for each amino acid, 
Monte Carlo analyses 3 
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b Permutation test of Llopart et al. (2008). 
All P-values are for 1 -tailed tests. 

A reported estimate of 0 indicates that none of 100,000 data permutations led 
to a higher value of the test statistic. 

For all 29 pairs, the average frequency of derived states per 
polymorphic site is higher for T— >C or A— >G sites than for corre- 
sponding C — > T or G — > A sites. If biased gene conversion is respon- 
sible for these SFS shifts, the magnitude of the shifts should be similar 
for all codon pairs (at least within the A/G or C/T class). There is, 
however, considerable variation among codon pairs in relative codon 
usage and the difference in derived states/site (Table 5). For C/T pairs, 
two-way analysis of variance [ANOVA; codon pair by direction 
(C— >T vs. T— >C)] indicated highly significant effects of codon pair 
(^15,12681 = 3.859, P < lO" 6 ) and direction (F hl268 i = 730.5, P < 
10~ 15 ), as well as a highly significant interaction effect (F 15>12 68i - 
3.250, P < 10~ 4 ). Similar results were obtained for A/G pairs (site 
type: ^12,9514 = 2.841, P < 10" 4 ; direction: F W514 = 576.1, P < 10" 15 ; 
interaction: ^12,9514 = 5.607, P < 10" 8 ). For the C/T pairs, the differ- 
ence in mean d correlates moderately with the difference in the pref- 
erence scores between the C- and T- ending codons, although not well 
with the degree of bias (Figure 5, A and B). For the A/G pairs, the 
difference correlates well with degree of bias, and somewhat with the 
difference in G- and A-ending codon preference scores (Figure 5, C 
and D). 

It is worth contrasting the SFS shifts for codons to those of introns, 
as shifts in SFSs may provide insight into composition -biasing process, 
such as biased gene conversion. Using D. lowei to infer the ancestral 
states, the SFSs for all 12 possible changes were obtained. Of particular 
note are those for C— >T, T— >C, G— >A, and A— >G, because these 
mirror the changes discussed previously for codons. The differences in 
mean derived states/site, although highly significant, are not as pro- 
nounced for introns as they are for synonymous sites. For C — > T and 
T— >C, mean d was 1.977 and 2.802, respectively (a difference of 
0.825). This contrasts with means of 2.231 and 3.582, respectively, 
for all C/T- segregating codons (a difference of 1.351). Two-way 
ANOVA on d (Table 5) indicated highly significant effects of site type 
(intron vs. codon) and direction, as well as a site type x direction 
interaction. Similarly, for G— >A and A— >G intron sites, mean d was 



Table 5 Analysis of variance (site type x direction) 
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d.f., degrees of freedom; SS, sum of squares; MS, mean squares, 
k Site type can be intron or codon third position. 
Direction can be, for example, C^T or T^C. 



1.939 and 2.885, respectively (a difference of 0.946). This contrasts 
with means of 2.187 and 3.614 for all G/A-segregating codons 
(a difference of 1.427). Again, two-way ANOVA on d indicated highly 
significant effects of site type and direction, as well as a strong site 
type x direction interaction. The site type x direction interaction 
effects indicate significantly larger shifts in codons relative to introns, 
as expected if the SFS shifts in codons are due to selection, and not 
only a composition-biasing influence shared by all sites, such as G/C- 
biased gene conversion. 

It is worth noting that restricting the analysis to the much smaller 
subset of sites in small introns does not qualitatively affect the results. 
Following Halligan and Keightley (2006), who proposed that short 
introns were less constrained than longer introns, sites were restricted 
to introns of 80 bp or shorter, excluding the first nine and last eight 
bases adjacent to splice junctions. There were 2414 C<->T sites and 
1544 G<->A sites. The SFS shift for C<->T sites was slightly reduced 
(0.645), whereas the SFS shift for G<->A sites (0.935) was essentially 
unchanged. 

Mean d in introns was nearly identical for G— >C and C— >G 
(2.341 and 2.375, respectively) and for A^T and T^A (2.117 and 
2.174, respectively); neither difference was statistically significant. 
Two-way ANOVA indicated only significant effects of site type (in- 
tron vs. codon); of note, there was no significant interaction effect (nor 
was there a significant direction effect). Intermediate results were 
obtained for G— >T vs. T— >G and C— >A vs. A— >C (both favoring 
changes toward G or C); all three effects in the ANOVA were signif- 
icant. The general implication is that the forces shifting SFSs for 
synonymous codon pairs are stronger than those for introns, consis- 
tent with previous observations in Drosophila (De Proce et al. 2012). 
Therefore, although there may be some effect of biased gene conver- 
sion on codon usage, this is insufficient to explain our observations. 



G3'Genes | Genomes | Genetics 



Volume 4 April 2014 I Varying Selection on Codon Usage I 687 



A 


2.5 


C>T 


2.0 


I 




2 




f 


IS 




l 0 


1 


1 


J 5 




0.0 



• a or-. 

• RLS-4f 

• lS-2f 
♦NDCMPT 



O.SO 0.55 0 60 0 65 0 70 0 75 0.80 
C/(OT) 



B 


2.5 


C>T 


2.0 


I 








s 


: 5 


V 


: Q 


H 

1 


1 


j 5 




0.0 



C 25 



< 

A 
(J 



19 

A 
< 



0.0 



D 2.5 



2.0 



1.5 



1.0 



0.5 



0.0 



♦ 



•agptv 

♦ RLS-4f 

♦ lS-2f 

♦ NDCHPT 



OS 06 0.7 08 09 10 
pref C ending - pref T ending 



l : 



•v 



• AGPTV 

• RLS-4f 

♦ RL-2f 

♦ EQK 



0.3 0.4 0.5 0.6 0.7 0.8 

G / (G+A) 



0.9 



V 



• AGPTV 

• RLS-4f 

♦ RL-2f 

♦ EQK 



0.0 0.2 0.4 0.6 0.8 1.0 1.2 
pref G-ending - pref A-ending 



1.4 



Figure 5 Shifts in site frequency spectra among codon pairs. (A) 
Difference in average frequency of derived states/polymorphic site for 
C/T codon pairs relative to codon usage {i.e., proportion of C-ending 
codons). (B) Difference in average frequency of derived states/site for 
C/T codon pairs relative to Apref for T— >C changes. (C, D) Corre- 
sponding figures for G/A codon pairs. Letters in legend correspond 
to single-letter amino acid codes; blue, fourfold degenerate amino 



DISCUSSION 

Our analyses corroborate the likely influence of natural selection on 
codon usage in Drosophila, specifically D. pseudoobscura. We also 
observe, as expected, a positive correlation between diversity and re- 
combination rate. As shown by Kulathinal et al. (2008), this associa- 
tion is stronger when recombination is estimated at a finer scale. 
However, in contrast to analyses on D. melanogaster, we observe no 
significant association of codon bias with recombination rate, despite 
considerable statistical power and reliable estimates of recombination 
rate at the finer scale advocated by Kulathinal et al. (2008). Evidence 
from polymorphism data for reduced natural selection on codon us- 
age in areas of low recombination in our dataset is limited at best. It is 
possible that recombination rate is recently reduced in these areas, 
such that AT e -reducing effects on diversity of linkage (especially by 
positive selection on recent beneficial mutations) are apparent, but 
the N e remains sufficient for effective selection. Changes in codon bias 
would not become apparent for some time, given that this requires 
accumulation of synonymous substitutions. 

Both findings indicate that recombination rate across chromosome 
2 exceeds the threshold necessary for effective natural selection. 
Recombination rates in D. pseudoobscura are generally higher than 
those estimated for D. melanogaster (Comeron et al 2012). In 
D. melanogaster, 21% of intervals had recombination rate estimates 
below 0.5 cM/Mb, in contrast with 6% in D. pseudoobscura. Although 
59% of intervals had recombination rates above 1.5 cM/Mb in 
D. melanogaster (56% for autosomal regions only), this value was 
exceeded in 82% of intervals in D. pseudoobscura. Furthermore, while 
the local N e could effectively vary across the chromosome due to the 
iV e -reducing effect of selection on linked sites in areas of lower re- 
combination, the generally higher N e in D. pseudoobscura may miti- 
gate the Hill- Robertson effect to some extent. That is, the product of 
N e and 5 may be sufficient over most of the chromosome for selection 
to effectively fix the optimal codons. 

Distortion of the SFS 

The shape of the overall SFS for the 32,729 synonymous polymorphic 
sites clearly differs from that expected for neutral variants in 
a constant-N population. First, the proportion of derived singletons 
(55.1%) is much higher than the 34.1% expected in a population of 
constant size (Figure 6). Second, there is a raised "tail" in the SFS, with 
more sites having 10 derived states than 9 derived states. 

Biological explanations for an excess of singletons include purifying 
background selection (Charlesworth et al. 1993) and population expan- 
sion (Tajima 1989a), both of which distort the coalescent of a popula- 
tion with constant N e to increase the relative lengths of branches upon 
which mutations would be observed as singletons (Tajima 1989a,b). 
Background selection or a demographic influence on the SFS should 
affect U — > P and P — > U sites similarly, but we observe a significantly 
stronger shift toward low-frequency derived states in P— >U sites. 
Therefore, while both influences may be at work in D. pseudoobscura, 
they are not sufficient to explain our observations. 

In addition to demographic effects, an excess of singletons can 
arise, in principle, from sequencing error. With a low value of 6, most 
sites in a small sample (here, k = 11) would be invariant. A single error 

acids; light blue, codon pair from fourfold degenerate subclass of 
sixfold degenerate amino acids; red, codon pair from isoleucine or 
twofold degenerate subclass of sixfold degenerate amino acids; gold, 
twofold degenerate amino acids. Dashed lines correspond to linear 
regression through all points. 
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Figure 6 Site frequency spectra corrected for sequencing error and 
ancestral state misassignment (ASM). Expected proportions under 
a constant-N e Wright-Fisher neutral model are shown in black; our 
data, assuming parsimony, are shown in blue. (A) Correction for 
ASM based on observed levels of diversity and divergence (following 
□opart et a/. (2008)). (B) Correction for ASM with a 0.1% sequencing 
error rate. (C) Correction for ASM with a 0.54% error rate. 

at an invariant site would produce an apparent singleton. Additional 
errors would add to the remainder of the SFS, but assuming that 
errors are independent, the impact would be seen mainly on single- 
tons. Assuming a 0.1% error rate (equivalent to a phred consensus 
quality score of 30), following the binomial distribution, 98.91% of 
truly invariant sites would be observed as invariant. However, 1.09% 
of invariant sites would be apparent singletons, whereas only 5.5 x 
10~ 3 % would appear to have two derived states (assuming that all 
errors produce the same character state). Likewise, 99.00% of true 
singletons would be observed as singletons, whereas 0.99% would 
present as having two derived states {i.e., be observed as "doubletons") 



and 4.4 x 10~ 3 % would present as having three derived states. This 
approach can be extended to all possible values for true and observed 
derived states {i.e., for a true invariant site appearing to have 1,2,... 10 
derived states; for a true singleton appearing to have 2, 3... 10 derived 
states; and so on), ultimately leading to 42.3% of sites having one 
apparent derived state. This is well below the observed value of 
55.1%, which would require an error rate of approximately 0.33%. 
Given the requirements of consensus quality scores of 30 or better, 
a minimum of 15 X coverage in highly inbred strains, and full reso- 
lution of all three codon positions in every strain (including the out- 
group), sequencing error alone does not explain the high proportion 
of singletons observed. 

It is further unlikely that the resequencing using the Illumina 
platform is leading to an accumulation of false As and Ts. Across 
a range of genomic G+C contents, Nakamura et at. (2011) found that 
A/T— >G/C errors were more likely than G/C— ► A/T errors. Although 
Illumina sequencing is more prone to base call errors than either ABI 
SOLiD or Roche 454 (Liu et al. 2012), error bias is unlikely to explain 
our observed shifts in SFSs. Most A/T — > G/C polymorphic sites reflect 
U — > P changes in D. pseudoobscura yet the proportion of singletons is 
much lower for U^P sites (43.1%) than it is for P^U sites (60.7%; 
Figure 4). 

SFS differences among codon pairs 

The SFS data are consistent with an influence of natural selection on 
codon usage in D. pseudoobscura, corroborating previous studies 
(Akashi and Schaeffer 1997; Haddrill et al. 2011). Additionally, the 
data indicate that intensity of selection varies among synonymous 
mutations. Vicario et al. (2007) previously proposed this possibility 
in a comparison of codon usage in the genomes of 12 Drosophila 
species. Although there is a correspondence between codon bias and 
SFS shifts consistent with differential selection, the nature of that 
differential selection is uncertain. Selection does not appear to be 
strongest on the more common amino acids within a degeneracy class, 
as might be expected for selection on efficiency of translation. For 
example, phenylalanine and tyrosine are used at an intermediate level 
among the C/T- ending twofold degenerate amino acids, despite show- 
ing the strongest SFS shifts in this class. Alanine is the most com- 
monly used fourfold degenerate amino acid, but its SFS shifts for 
C<->T and A<->G changes are intermediate within the degeneracy 
class. On the other hand, for sixfold degenerate amino acids, leucine 
and serine are used more often than arginine (and are among the most 
used amino acids overall) and show much stronger SFS shifts for 
changes in the fourfold degenerate subsets; in fact, their SFS shifts 
are among the strongest observed in this analysis. Vicario et al. (2007) 
proposed ad hoc explanations for stronger selection on some amino 
acids {e.g., for accurate translation of disulfide bridge-forming cysteine 
or for accurate and efficient translation of heavily used hydrophobic 
leucine). Potential influences of isoaccepting tRNA pools on codon 
bias have been identified for Drosophila (Moriyama and Powell 1997; 
Powell and Moriyama 1997), but these authors note that pools likely 
change over time within an individual (White et al. 1973) and that this 
plasticity may itself influence codon bias among amino acids. Further- 
more, the complement of tRNA genes likely changes over evolution- 
ary time in Drosophila, with evidence of numerous gains, losses, and 
reassignments (Rogers et al. 2010). The observed slightly stronger shifts 
for A <-> G vs. C <-> T synonymous changes may reflect differences in 
composition -biasing influences, as reflected in corresponding SFS shifts 
for introns. It is, therefore, probably premature to speculate too exten- 
sively on the bases of differential selection among amino acids. 
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Table 6 Summary data for C/T and G/A segregating and fixed different codon third positions 



Codon Pair 


Amino Acid 


C 3 or G 3 a 


Apref T A->cG b 


S/N (CG) C 


S/N (TA) d 


dcG->TA '■ djA-> 


GC C/T 


Ala 


0.757 


0.972 


684/24,976 


193/8,275 


2.251:3.539 


GG C/T 


Gly 


0.747 


0.652 


574/19,504 


226/7,027 


2.336:3.319 


CC C/T 


Pro 


0.774 


0.829 


428/14,444 


97/4,144 


2.185:3.557 


AC C/T 


Thr 


0.724 


0.792 


458/15,326 


104/6,117 


2.216:3.808 


GT C/T 


Val 


0.663 


0.737 


355/12,959 


98/7,01 1 


2.352:3.622 


CG C/T 


Arg4 


0.692 


0.591 


509/14,398 


194/7,044 


2.369:3.170 


CT C/T 


Leu4 


0.728 


0.774 


525/14,629 


116/5,869 


2.051:3.776 


TC C/T 


Ser4 


0.743 


0.638 


436/14,149 


150/5,164 


1.959:4.093 


AT C/T 


lie 


0.601 


0.929 


753/21.090 


241/15,310 


2.112:3.560 


AG C/T 


Ser2 


0.708 


0.567 


468/16,090 


223/7,353 


2.348:3.090 


AAC/T 


Asn 


0.552 


0.922 


728/21,727 


553/19,956 


2.404:3.562 


GA C/T 


Asp 


0.505 


0.784 


684/21,184 


618/24.206 


2.367:3.565 


TG C/T 


Cys 


0.723 


0.608 


338/10,916 


162/4,458 


2.346:3.525 


CA C/T 


His 


0.578 


0.662 


338/11,373 


255/9,781 


1.976:3.333 


TT C/T 


Phe 


0.629 


1.064 


984/20,878 


372/13,361 


2.181:4.024 


TAC/T 


Tyr 


0.621 


0.842 


511/15,979 


338/11,392 


2.160:3.781 


GC G/A 


Ala 


0.522 


0.567 


278/8.438 


161/8,691 


2.392:3.503 


GG G/A 


Gly 


0.361 


0.208 


229/4,821 


201/9,551 


2.655:2.960 


CC G/A 


Pro 


0.557 


0.653 


308/8,910 


181/8,456 


2.360:3.508 


AC G/A 


Thr 


0.571 


0.620 


408/12,025 


181/10,085 


2.368:3.320 


GT G/A 


Val 


0.849 


0.942 


631/23,036 


113/4,307 


2.019:3.894 


CG G/A 


Arg4 


0.566 


0.442 


217/5,828 


115/5,398 


2.618:3.104 


CT G/A 


Leu4 


0.874 


1.140 


959/34,643 


169/5,649 


2.088:4.172 


TC G/A 


Ser4 


0.730 


0.684 


399/12,539 


102/5,237 


2.120:3.941 


AG G/A 


Arg2 


0.574 


0.283 


145/4,236 


70/3,394 


2.159:3.171 


TT G/A 


Leu2 


0.827 


0.138 


334/12,658 


74/2,788 


2.135:3.824 


GA G/A 


Glu 


0.722 


1.448 


1323/38,468 


408/16,383 


2.225:3.733 


CA G/A 


Gin 


0.753 


1.312 


802/28,876 


232/10,386 


2.170:3.608 


AA G/A 


Lys 


0.723 


1.400 


1165/37,359 


335/14,953 


2.039:3.878 



k Usage of the codon ending in C or G for a C/T or G/A codon pair, respectively. 
Zipref for a substitution of a T- or A-ending codon with the corresponding C- or G-ending codon. 

S, frequency of polymorphic sites with C or G as the ancestral state; N, frequency of sites with C or G as the ancestral state; frequencies are reported only for sites 
^ that are fully resolved at all three codon positions in all D. pseudoobscura and D. lowei sequences. 
N and S for sites with T or A as the ancestral state (see c). 

Mean frequency of derived states per site; CG— >TA, ancestral state ends with either C or G; TA^CG, ancestral state ends with either T or A. 



That this main result is not an artifact of sequencing error is 
reinforced by differences among codon pairs in the proportion of 
derived singletons and the proportion of polymorphic sites. Variation 
in the proportion of derived singletons among all 16 ancestrally 
C-ending codons is not quite significant (G = 24.91, 15 d.f., P = 
0.0511), and there is no significant variation for ancestral T-ending 
codons (G = 14.85, 15 d.f., P = 0.462). However, variation in the 
proportion of derived singletons is significant for both ancestral 
G-ending (G = 21.42, 12 d.f., P = 0.0445) and A-ending codons 
(G = 22.98, 12 d.f, P = 0.0279). 

For C/T codon pairs, 3.25% of ancestral C-ending codons are 
polymorphic, whereas 2.51% of ancestral T-ending codons are 
polymorphic; this difference is highly significant (G test of indepen- 
dence, G = 189.4, 1 d.f., P < 10" 15 ). Although this finding could 
indicate a higher probability of C — > T errors (which is not likely with 
Illumina platforms; see Distortion of the SFS) y there is considerable 
variation among C/T codon pairs in the proportion of polymorphic 
ancestral C-ending codons (G test of independence: G = 199.3, 15 d.f., 
P < 10~ 15 ) and polymorphic T-ending codons (G = 197.4, 15 d.f., 
P < 10~ 15 ). Similar results were obtained for G/A codon pairs; 3.13% of 
ancestral G-ending codons were polymorphic and 2.22% of ancestral 
A-ending codons were polymorphic (G = 224.5, 1 d.f., P < 10~ 15 ) . 
However, the proportions varied among codon pairs (ancestral 
G-ending: G = 102.2, 12 d.f, P < 10" 15 ; ancestral A-ending: G = 42.7, 
12 d.f., P < 10~ 4 ) (see Table 6 for proportions). Examining codon 



pairs from fourfold degenerate sites only, G tests remained significant 
for all four ancestral bases (all P- values below 10~ 4 ). Likewise, for 
twofold degenerate sites only, G tests were highly significant for an- 
cestral C-ending and T-ending codons (P < 10~ 15 ) and for ancestral 
G-ending codons (P = 0.00012); however, there was no significant 
heterogeneity among codon pairs for ancestral A-ending codons 
(P = 0.403). Thus, there is considerable variation in the proportion 
of polymorphic codons, even within similar degeneracy classes, a result 
that does not support a major role for sequencing error, but is con- 
sistent with differences among codon pairs in the influence of weak 
selection. 

Impact of ASM 

The raised tail of the SFS may be explained by ASM, especially if the 
SFS is already left-shifted. Under a model with constant N e , we expect 
a ratio of 10:9 for sites with 9 vs. 10 derived states. However, if we 
allow ASM with a probability of 1/(LR+1), where LR is the likelihood 
ratio in Equation 1, we can solve for the value of divergence required 
for a given value of 6 that would lead to a 1:1 ratio by (a) factoring in 
sites with 1 or 2 derived states that present as having 10 or 9 derived 
states and (b) factoring out sites with 10 or 9 derived states that would 
present as having 1 or 2 derived states. For our 6 = 0.0222, a divergence 
of 0.0424 would be sufficient and would lead to 4.1% of sites present- 
ing 9 and 4.1% of sites presenting 10 derived states under a constant- 
N model. While both values exceed our observations, the excess of 
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singletons by necessity decreases the proportion of sites in the re- 
mainder of the remainder of the SFS. We estimated per-site synony- 
mous divergence at 0.0763, which leads to a slightly raised tail (a ratio 
of 1.12 for 10:9 derived states); we observed a ratio of 1.54. 

If we assume a constant-N model, but with ASM and a sequencing 
error rate of 0.1%, we can begin to approach the observed SFS (Figure 
6). However, we still observe an excess of singletons; we still need 
a much higher error rate to approach the observed SFS. In fact, for 
P — > U sites, the error rate would have to be approximately 0.54% to 
produce the extreme left shift. It is not possible to reproduce the SFS 
for U— >P sites, with a slight excess of singletons and a markedly 
raised tail, with sequencing error and ASM alone. 

Analysis of SFS of derived synonymous mutations in D. pseudoobs- 
cura indicates that the intensity of natural selection varies among 
classes of synonymous mutation. The shapes of the SFSs are likely 
shaped by other influences, possibly including ASM and sequencing 
error, but variation among synonymous codon pairs in the extent of 
the SFS shifts support differential intensity of selection. 
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